Prognosis of COVID‐19 patients using lab tests: A data mining approach

Abstract Background The rapid prevalence of coronavirus disease 2019 (COVID‐19) has caused a pandemic worldwide and affected the lives of millions. The potential fatality of the disease has led to global public health concerns. Apart from clinical practice, artificial intelligence (AI) has provided a new model for the early diagnosis and prediction of disease based on machine learning (ML) algorithms. In this study, we aimed to make a prediction model for the prognosis of COVID‐19 patients using data mining techniques. Methods In this study, a data set was obtained from the intelligent management system repository of 19 hospitals at Shahid Beheshti University of Medical Sciences in Iran. All patients admitted had shown positive polymerase chain reaction (PCR) test results. They were hospitalized between February 19 and May 12 in 2020, which were investigated in this study. The extracted data set has 8621 data instances. The data include demographic information and results of 16 laboratory tests. In the first stage, preprocessing was performed on the data. Then, among 15 laboratory tests, four of them were selected. The models were created based on seven data mining algorithms, and finally, the performances of the models were compared with each other. Results Based on our results, the Random Forest (RF) and Gradient Boosted Trees models were known as the most efficient methods, with the highest accuracy percentage of 86.45% and 84.80%, respectively. In contrast, the Decision Tree exhibited the least accuracy (75.43%) among the seven models. Conclusion Data mining methods have the potential to be used for predicting outcomes of COVID‐19 patients with the use of lab tests and demographic features. After validating these methods, they could be implemented in clinical decision support systems for better management and providing care to severe COVID‐19 patients.

COVID-19 intensive care unit (ICU) to develop, evaluate and validate various ML models for predicting the prognosis of COVID-19 patients. 11 The combination of image datasets and ML has also helped in the diagnosis of COVID-19. In the study of Muhammad et al. by using X-ray images of patients' chests and ML models, they were able to extract the image features of 13 In another study, Gumaei et al. used time-series data on the number of people with COVID-19 worldwide. They tried to predict patients with the disease using different ML models. 14 Research has also been done to develop drugs in this area. Using ML techniques, Jamshidi et al. in their studies were able to reach, a framework based on DL methods that have been presented to illustrate how AI can accelerate the process of drug development. This framework includes eight layers, which are responsible for identifying, analyzing, and predicting the drug's performances in different stages. 15,16 The use of ML models has advanced to the point that it were monitored for more than 15 days, and models were made according to baseline data from two groups, including the severe and nonsevere groups. They used a nomogram for predictions of severe meta-classifier, with an accuracy of 84.21%, is the most reliable classifier to predict positive and negative COVID-19 instances. 19 During the Corona outbreak, one of the greatest challenges for humanity was the proper management and response to this disease. 20 According to the literature, ML can be useful in COVID-19 research, diagnosis, and prediction. 21 Therefore, for quick and very effective prediction to diagnose COVID-19 patients, two stages of feature selection and COVID-19 diagnosis stage can be used. 19 Success in combating such epidemics depends heavily on building an arsenal of platforms, methods, approaches, and tools that converge to achieve desired goals and make life more satisfying. 22 Although, as mentioned, many studies have been performed using different ML models worldwide, there is still a need to develop and evaluate these models with other datasets. 17 The motivation and contribution of this study are to develop predictive models to determine the outcome of COVID-19 patients using ML methods with a novel feature selection algorithm. These models were aimed to be trained based on Iranian hospitals' data that could help clinicians to prognosis COVID-19 patients by Lab tests.
Thus, this study aimed to propose a prediction model for the prognosis of COVID-19 patients (will the patient survive or die) using data mining techniques based on an Iranian data set of COVID-19 patients. Accordingly, different steps were organized for this research. Overall, the methods section includes phases of data set collection, preprocessing, feature selection, and modeling and evaluation. In Section 3, the main findings of this study are presented which included the result of features selection and evaluation of data mining models as well as comparative indicators and diagrams related to the built models. In Section 4, the results of comparing the findings of our study with similar studies are represented. Finally, after stating the study's limitations, general conclusions are mentioned along with suggestions for future research.

| METHODS
The methods used in this research are consistent with the related guidelines. The steps for conducting this research are represented in Figure 1. Overall, the method includes data set collection, data set preprocessing, feature selection, and modeling and evaluation which are described in the following sections.

| Data set collection
The data set was obtained from the Hospital Intelligent Management The features data type is tabulated in Table 1.

| Data set preprocessing
Preprocessing is a necessary process to produce an efficient classification model that impacts the performance of ML methods. 21 In the first step, duplicate records were recognized and removed based on the national identification codes of patients to preprocess the data. The label (survived or dead) is transformed to binomial values in the data set.
In the next step, the data set containing the lab results of patients (Table 2) has been converted to a columnar data set with one patient per row and all test results and discharge states as columns.
Due to the differences in the lab tests for each patient, the resulting data set was sparse.

| Feature selection
The feature selection process is shown in Figure 2. The process includes an independent t test, features subset calculation, and feature subset selection that are described below.
Steps of conducting the research KHOUNRAZ ET AL.

| Feature subset selection
The list of features subsets and scores related to each features subset is sorted, and the subset with the most features and a score of more than 1000 is chosen for the final data set (having more than 1000 nonmissing records for all features subsets with more features count; Figure 3).

| Modeling and evaluation
Logistic Regression, Gradient Boosted Trees, Naive Bayes, Decision Trees, Support Vector Machine, and Generalized Linear models are generated and evaluated using 10-fold cross-validation with Rapid Miner Studio software. The creating and assessing process of these models consists of feature selection, optimization of model parameters using the train data, and evaluation of the model using the test data, in the training and testing phases, respectively. Also, synthetic minority oversampling technique 23 is applied for balancing the train data. This process divides the data set into 10 nonoverlapping folds.
Each of the 10-folds is given to be used in the test stage, while all other folds are used collectively in the training stage. A total of 10 models are fit and evaluated on the 10 hold-out test sets, and the mean performance is reported.

| Logistic regression (LR)
LR is a kind of regression analysis in statistics used to predict the outcome of a definite-dependent variable from a set of predictor or independent variables. 24 The relationship between a categorical variable and dependent factors of any kind of categorical, continuous, or binary can be analyzed using LR. 25 When the dependent variable has two values (0 and 1 or yes and no), it can be used, referring to binary logistic regression. 26

| Naive Bayes (NB)
NB is a subdivision of Bayesian decision theory called naive as the formulation makes some naïve assumptions and can classify documents astoundingly well. 27 NB is one of the simplest probabilistic classifiers. 28,29 The classifier simplifies the learning process by anticipating that features are independent of given classes. 28 The resulting classifier is significantly prosperous in practice, even often competing with more sophisticated techniques. 26 NB is efficient in several practical applications. Text classification and medical diagnosis are examples of such applications. 27

| Support vector machine (SVM)
An SVM is used for analyzing data, discovering patterns in classification, and regression analysis. 24 As a powerful tool for data classification, this model can classify two categories, classifies two categories which are pointed by assigning them to one of two disjoint half-spaces, in the case of linear classifiers in the original input space or nonlinear classifiers, in the higher dimensional feature space. 26 The larger space between the two classes, the better the model will be.
Also, SVM works much better on datasets with many attributes. 24

| Gradient boosted trees
Additional trees are combined strategically in the gradient boosting tree method by correcting mistakes that previous models made. Thus, it is more likely to increase the accuracy of prediction. Using Gradient Boosting of regression trees, it is possible to produce competitive, robust procedures that are also interpretable for regression and classification, especially suitable for mining less than clean data. 30 Boosting algorithms are relatively easy to implement and allow for experimentation with various model designs. The GBMs have demonstrated significant progress in the practical applications and challenges of data mining and ML. 27

| Decision tree (DT)
As a promising tool, a DT can predict response to data by using classification or regression and is one of the primary data mining methods. If the features are grouped, classification is used, and if data are continuous, regression is used. DT is constructed of a root node, leaf nodes, and branches. The evaluation of the data is possible by following the path from the root node to reach a leaf node. 24 Two phases make a tree, the first is tree-growing (building), and the second is tree-pruning.
In the first phase, the algorithm begins with the entire data set at the root node; the data set is splatted into subsets, which is repeated for the next steps (for each subset) until each member becomes sufficiently small. In the next phase (tree-pruning), to boost the accuracy of the tree, the whole tree is cut back to avoid over-fitting. 31

| Generalized linear model (GLM)
The GLM provides a comprehensive and favored method for statistical analysis. In particular, the ability to predict can be valuable for the assessment of the practical importance of the predictors and to compare competing GLMs. 32 These models are easy to interpret, and the methods are theoretically well understood and explained. 33 GLM extends the concept of the linear regression model. 34 The term GLM came from Nelder and Wedderburn 35 and McCullagh and Nelder. 36 They reported that if the distribution of Y, as a dependent variable, is a member of the exponential class, the GLM could be specified by two components, including the distribution of Y and the link function. 37 2.4.8 | Random forest (RF) As an ensemble learning method, RF or random decision forest could be used for tasks such as classification and regression. In this model, easily be served to large-scale problems, adapted to different ad hoc learning tasks, and returns variable importance measures. 39 Although RF performs better than decision trees, its accuracy is less than the gradient-boosted trees. Having said that, data characteristics can impact their performance. 40

| Model evaluation
Also, evaluating the performance of similar steps without performing the feature selection phase is possible. Sensitivity and specificity, as two vital key factors, could determine the validity of a model. 41 Thus, the accuracy, sensitivity, and specificity of these models have been evaluated using a using10-fold cross-validation method using Formulas 1-3:

TP TN TP FP FN TN
3 | RESULT

| Feature selection
The Independent t test p values of lab tests are shown in Table 3.
Of these lab tests shown in Table 3 of the receiver operating characteristic curve of the models can be seen in Figure 4.

| DISCUSSION
This study examined the performances of classification models to predict COVID-19 mortality based on seven models.